Adverse drug event detection using natural language processing: A scoping review of supervised learning methods

To reduce adverse drug events (ADEs), hospitals need a system to support them in monitoring ADE occurrence routinely, rapidly, and at scale. Natural language processing (NLP), a computerized approach to analyze text data, has shown promising results for the purpose of ADE detection in the context of pharmacovigilance. However, a detailed qualitative assessment and critical appraisal of NLP methods for ADE detection in the context of ADE monitoring in hospitals is lacking. Therefore, we have conducted a scoping review to close this knowledge gap, and to provide directions for future research and practice. We included articles where NLP was applied to detect ADEs in clinical narratives within electronic health records of inpatients. Quantitative and qualitative data items relating to NLP methods were extracted and critically appraised. Out of 1,065 articles screened for eligibility, 29 articles met the inclusion criteria. Most frequent tasks included named entity recognition (n = 17; 58.6%) and relation extraction/classification (n = 15; 51.7%). Clinical involvement was reported in nine studies (31%). Multiple NLP modelling approaches seem suitable, with Long Short Term Memory and Conditional Random Field methods most commonly used. Although reported overall performance of the systems was high, it provides an inflated impression given a steep drop in performance when predicting the ADE entity or ADE relation class. When annotating corpora, treating an ADE as a relation between a drug and non-drug entity seems the best practice. Future research should focus on semi-automated methods to reduce the manual annotation effort, and examine implementation of the NLP methods in practice.


Introduction
Adverse drug events (ADEs) represent a significant clinical problem in healthcare, owing to the increasing multimorbidity and complexity of medical treatement. Therefore, improving medication safety has been set as a global patient safety challenge, with a goal to reduce the level of severe, avoidable harm related to medication by 50% over 5 years [1]. Since the pooled prevalence of ADEs in the hospital setting is twice as hight as the pooled prevalence in primary care (19% versus 8%) [2,3], we focus on this more vulnerable patient population. In order to improve medication safety in hospitalized patients, hospitals need to have accurate and continuous insight into what type of ADEs occur in their inpatients including which subpopulations are at high ADE risk. Such information is crucial in order to gain better understanding of the medication, patients, and clinical processes that are most amenable to medication safety interventions and on which of these to focus their efforts. One of the major barriers for gaining such insight is lack of a monitoring system that can routinely, rapidly and at scale detect ADEs in hospitalized patients [4]. Such a system would help to obtain information about ADEs that have occurred in hospitalized patients. Subsequently, this information could be used to predict ADE occurrence in future inpatients, supporting clinicians in timely ADE recognition. At present, most hospitals rely on voluntary reporting of ADEs by healthcare staff, yet numerous studies have shown that this approach detects less than 1% of all ADEs [5]. The more comprehensive ADE identification methodpatient chart review by pharmacists-can identify up to 20 times more ADE but is prohibitively expensive and time-consuming [6,7].
The widespread adoption of electronic health record (EHR) systems has led to repositories of digital patient data, creating the potential to use information technology to generate computerized ADE monitoring systems in hospitals for routine, rapid and continuous analysis of the vast amounts of data [8]. However, since most information about ADEs tend to be registered in EHRs as free text mentions in clinical narratives (such as progress notes or discharge letters), extensive processing and formatting of this data is needed in order for a computer to accurately analyse it [9,10]. The use of natural language processing (NLP) may help to address these challenges.
NLP is a domain of computer science that uses computers to manipulate free text data in the context of a specific task [11]. NLP has been investigated in the clinical domain for a range of tasks, from extracting information on medication dosage to classifying cancer staging from pathology reports [12]. Regarding the task of detecting ADEs, the majority of NLP efforts focus on pharmacovigilance [9,11]. Two recent literature reviews, one systematic and one narrative, on this topic have provided a strategic overview of the progress that has been made with NLP on pharmacovigilance using EHR data, as well as the challenges pertaining to such a task [9,11]. Challenges highlighted in these studies include limited data sharing between healthcare organizations and in detecting ADEs that arise from polypharmacy (drug-drug interactions) [9,11]. In addition, a recent scoping review on key use cases for artificial intelligence to reduce the frequency of ADEs, promising NLP applications are presented [13]. However, these reviews lack a detailed description of the steps needed to apply NLP for ADE detection using EHR data in the context of ADE monitoring in hospitalized patients, including critical appraisal of NLP methods used.
Furthermore, most previous studies on ADE detection using NLP have investigated detection of separate clinical entities such as diagnoses, drug names and associated attributes such as dose, route, frequency, or looked at ADEs in the context of pharmacovigilance and post-market surveillance using predominantly spontaneous reporting databases. However, when detecting ADE mentions in clinical notes, both the drug and adverse event must be detected as well as the causality that links them. This causal element is missing when searching for separate entities. This complexity is often overlooked. In addition, spontaneous reporting databases include data which differ greatly from clinical narratives in EHR databases. Therefore, we have conducted a scoping review to close this knowledge gap. Our aim is to examine the use of NLP methods in detecting ADE mentions in clinical notes in order to improve medication safety in hospitalized patients. We examine supervised learning methods since these are the most common type of machine learning applied in the medical domain. Focusing on the hospital setting enables a better comparison of the NLP methods presented in our scoping review. This work also includes a structured framework and critical appraisal of the included studies. The results give insight into strengths and limitations of current NLP applications for the task of ADE detection in hospitalized patients, and provide guidance on how to move forward to create NLP-based systems fit for purpose of monitoring medication safety in hospitals. Overall, this paper aims to serve as a reference point for both data scientists, clinicians and pharmacists as well as for decisionmakers in the clinical medication safety domain, particularly from the methodological point of view.

Approach
Our approach for conducting the scoping review is based on a set of recommendations outlined by Arksey and O'Malley [14] and the additional recommendations on this framework proposed by Levac et al. [15]. We further implemented recommendations specific to methodology scoping reviews, including identification of search terms, iterative search technique, and features to extract [16]. For reporting we have used the Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) checklist [17]. The checklist can be found in S1 Checklist. The corresponding PRISMA flow diagram is shown in Fig 1 and further explained in Section 3.1.
To create a framework for the review and critical appraisal of the included articles, we have used the Cross-industry standard process for data mining (CRISP-DM) [18] as a reference model to describe the stages and steps of the workflow to use NLP for ADE detection in hospitalized patients. This framework for NLP workflow is depicted in Fig 2. Box 1 provides a glossary of NLP terms used in this review.

Information sources
We used two types of information sources. The first type of source was peer-reviewed literature databases (MEDLINE and EMBASE). The search in these databases was most recently conducted on 1 st July 2021.
The second type of source was an open access archive and pre-print server for scholarly articles in the fields of computer science, statistics, and quantitative biology (among others), named arXiv. Articles submitted to arXiv are not peer-reviewed. This source was included to look for state-of-the-art technical literature. This search was most recently conducted on 1 st July 2021.

Search strategy
The search strategy was centred on three key themes, namely natural language processing, clinical narratives, and adverse drug events. Our choice of search terms was made following testing of individual search terms and consulting other reviews in NLP of clinical narratives [12,25]. We did not employ any date or other types of filters to the search queries. The full search strategy is available in S1 File.
in the screening process [27]. The screening was conducted by RMM with MCS reviewing 20% of the decisions to ensure that concordance was achieved. MCS was not blinded to the decisions made by RMM. Where decisions differed or were uncertain, JEK made the final decision.
The full text of the remaining articles was then reviewed to identify those that matched the inclusion and exclusion criteria. The full text review was performed by RMM, with a 10% sample blind reviewed by JHL. Where decisions differed or were uncertain, JEK made the final decision.

Eligibility criteria
To keep the information retrieved relevant and specific to our aim of detecting ADE mentions in clinical notes in order to improve medication safety in hospitalized patients, we chose to exclude articles that use clustering methods to find patterns of ADEs in clinical notes-this is commonly done with unsupervised learning. The output of such methods is primarily groups of patients with a distinguishing adverse event. Such output does not align with our aim. Similarly, we also chose to exclude articles that used data from the primary care setting and data from pharmacovigilance sources, as this data differs in several ways from inpatient clinical notes. Pharmacovigilance data, such as spontaneous reporting system data or drug labels, tend to have more formal and structured language, and the data is documented for a different purpose (signal detection or information dissemination rather than communicating with colleagues). The spontaneous reporting system data can be semi-structured (i.e. the drug name may be selected from a predefined list), unlike clinical notes where reference to the drug could be given by the brand name, drug name, drug group, or simply an abbreviation (e.g. "AKI due to AB", meaning "acute kidney injury due to antibiotics"). Notably, the purpose is to report an ADE; therefore each spontaneous report should contain an ADE mention. In contrast, most inpatient clinical notes will not contain an ADE mention. Primary care and hospital care narratives differ greatly in terms of their frequency, structure, style, and language used, which Box 1. Glossary of natural language processing and technical terms used in this review.

Supervised machine learning
Task of learning a model from labeled training data consisting of a set of training examples.

Unsupervised machine learning
Task of learning from data that does not have any labels; therefore it seeks patterns that naturally occur in the dataset.

Corpus
A collection of texts forming a dataset.

Sentence boundary detection
A technique to determine where one sentence ends and the next sentence begins; also known as sentence segmentation.

Tokenization
Breaking down text into units known as tokens; a token may be a word, part of word, or punctuation.

Part-of-speech (POS) tagging
Categorizing tokens in a text with their corresponding part of speech, such as noun, verb, adjective, et cetera.

Annotation
The process of applying predefined labels or categories to text data; can be performed at the level of documents, sentences, phrases, or words. At the document or sentence level, labels are typically binary, for example: presence or absence of an ADE mention in the text; at the word or phrase level assigned labels are commonly multi-category (see Entity and Relation).

Entity
A term or phrase in the text representing information to be extracted; in the clinical context this often includes diagnoses ("acute myocardial infarction"), symptoms ("fever"), and drug names ("morphine").

Named entity recognition (NER)
A task of identifying and classifying entities in the text using a model.

Named entity normalization (NEN)
Also called medical concept normalization, this is the task of matching entities to concepts in a medical terminology such as SNOMED-CT or ICD-10.

Entity attribute labelling
A task of identifying and classifying attributes of entities in the text using a model; sometimes called concept attribute labelling.

Relation
A relationship that exists between two entities; for example, between a symptom entity and a drug entity possible relations include ADE and Indication.

Relation extraction
Identifying and classifying relationships between named entities using a model.

Performance measures and dataset characteristics
True positives A true positive (TP) is an instance that is correctly classified as positive; for example, the correct identification of an ADE in a text.

False positives
A false positive (FP) is an instance that is incorrectly classified as positive; for example, the identification of an ADE in a text where the text does not mention an ADE.

True negatives
A true negative (TN) is an instance that is correctly classified as negative; for example, the correct identification that a text does not mention an ADE.

False negatives
A false negative (FN) is an instance that is incorrectly classified as negative; for example, the incorrect identification that a text does not mention an ADE when the text does refer to an ADE.

Confusion matrix
A table of TP, FP, TN, and FN that is used to cross-tabulate the true and predicted classes that is used in machine learning to show the performance of a classification model.

Accuracy
Also called classification accuracy.

Precision
Also called Positive Predictive Value (PPV).

Recall
Recall minimizes the impact of False Negatives and is a good metric for imbalanced data because it focuses on the minority class. Also referred to as sensitivity.

F1 score
The harmonic mean of precision and recall. The harmonic mean uses the reciprocals of the values and therefore minimizes the impact of large outliers; therefore, in order to have a high F1 score, you need to have both a high precision and a high recall.

ROC curve
This plot illustrates how well a binary classifier performs as its discrimination threshold is varied between 0 and 1. It compares the true positive rate and the false positive rate as the threshold changes. The area under this curve is often used in machine learning to quantify the discriminative ability of a model and to compare models.
(Continued ) merits a separate review of NLP methods applied to primary care and hospital care narratives. While these other types of data can contribute to the overall performance of an NLP pipeline for ADE detection, we chose to focus on the articles that use only clinical narratives from EHRs, to align closely with our aim. We applied the following eligibility criteria to our search results.
1. Articles that describe NLP application for ADE detection in clinical narratives in EHRs of hospitalized patients were included.
2. Articles that used a list of terms to search for ADEs in clinical narratives were excluded.
3. Articles with NLP where the underlying method was not described were excluded.
4. Articles describing clustering methods on unlabelled data were excluded.
5. Conference papers were included, but conference abstracts that were insufficiently detailed were excluded.
6. Articles that used literature databases, drug labels, or spontaneous reporting systems as the ADE data source were excluded.
7. Articles that used primary care or community care narratives were excluded.
8. Articles that combined clinical narratives with spontaneous reporting systems to detect signals in pharmacovigilance were excluded.

Data charting and critical appraisal process
The data extraction form was based on relevant items from the CHecklist for critical Appraisal and data extraction for systematic Reviews of prediction Modelling Studies (CHARMS) and Prediction model Risk Of Bias Assessment Tool (PROBAST) reporting guidelines [28,29]. In addition we included items recommended by Kersloot et al. [30] for the evaluation and validation of NLP algorithms. Feedback from MCS and JEK on a sample extraction was used to produce the final data extraction chart. RMM completed the data charting and critical appraisal. The data was charted in Microsoft Excel and summarized using Microsoft Excel and RStudio. (Continued)

Term Definition
Other terms

Imbalanced data
A dataset with data belonging to distinct classes, where the size of one class is much larger than the other class (the precise difference between the groups in order to qualify as imbalanced data is not explicitly defined, but often data with a ratio of 1 (or less) to 10 is considered imbalanced). Imbalance data are common in ADE research.

Internal validation
Quantifies the performance of a model on unseen data from the same population as the training data.

External validation
Demonstrates how well a trained model performs on an external dataset. Important because it shows how well a model generalizes to new data.

Overfitting and underfitting
Overfitting is where a model learns the training data too well, which negatively impacts its performance on new data and means it cannot generalize well. In contrast, underfitting is where a model does not sufficiently learn from the training data. An underfit model performs poorly on the training data and also cannot generalize well to new data. Ideally the aim is to select a model that neither under-nor overfits to the training data.

Shared task challenge
A competition in which a common dataset is provided to participants with the goal of applying a technique to achieve a specific task, usually with the aim to increase engagement with the problem.

Data items
The full list of items for which we sought to extract data is shown in Table 1. When extracting data relating to clinical involvement, we looked for any explicit mention in the manuscript of a clinical role (physician, pharmacist, nurse, medical student, or allied health professional) contributing in any way to the study.

Selection of sources of evidence
The final search yielded 1,550 articles. We removed 485 duplicates leaving 1,065 articles for title and abstract screening. We excluded 967 articles during title and abstract screening and 69 articles during full text screening, leaving 29 articles for data extraction. Fig 1 illustrates the selection process.
The main reason to exclude articles during the screening stages was that the data used in the article was not clinical narratives (n = 570). Other common reasons for exclusion were that the articles were not about ADEs (n = 243) or not about NLP (n = 77).

Characteristics of sources of evidence
The 29 included articles were published between 2011 and 2021 (see Fig 3a); note that we did not apply date filters to the searches. The journal category was identified as per the procedure outlined by Sheikhalishahi et al.

PLOS ONE
Adverse drug event detection using supervised natural language processing: A scoping review

Summary of main findings
To summarize, Table 2 outlines the number of articles reporting items in the framework for NLP workflow (Fig 2) that we created to appraise the papers.

Data description
Publicly available datasets were used in 15 (51.7%) of the studies, while 14 (48.3%) studies made use of data from their own institutions. The vast majority of datasets have a size of hundreds or thousands documents (see Table 3). Only one (2.6%) study used tens of thousands of clinical narratives, despite the fact that machine learning approaches are data hungry in the sense that their performance is strongly correlated with the amount of training data available [60]. The number of ADEs in the datasets (where reported) was in the hundreds (range 144-1,940, see Table 3).
Just over one third of the studies (n = 10; 34.5%) did not state the clinical domain or patient type studied. Of the studies that did report this information, oncology (n = 9; 31%) and critical care (n = 6; 20.7%) were the most studied clinical domains (see Fig 4). The most commonly specified note type was a discharge summary or discharge letter (n = 10; 34.5%).

Data preparation
3.6.1 Annotation. While some studies had access to labelled data (most notably those participating in the shared task challenges), ten studies (34.5%) reported annotating their own datasets. For the studies using entities, there were two approaches to defining ADEs. Some defined an ADE entity, and some defined an ADE as a relation between a drug entity and a non-drug entity (see Fig 5).
Two studies provided detailed accounts of creating a gold standard annotated corpus in a language other than English, and both made their annotation guidelines available. Oronoz  [33].
Five studies defined ADEs neither as entities nor relations, but at a higher level (sentence, document, or patient). Two studies annotated their data at the patient level, marking each patient as positive or negative for experiencing an ADE [36,42]. For the studies with

PLOS ONE
Adverse drug event detection using supervised natural language processing: A scoping review document-level classification, either documents or sentences were assigned binary labels indicating presence or absence of an ADE. 3.6.2 Pre-processing. Most of the studies (n = 25; 86.2%) reported some form of typical NLP pre-processing tasks including sentence boundary detection, tokenization, and part-ofspeech tagging. Some commented on difficulties encountered when applying off-the-shelf

PLOS ONE
generic pre-processing tools to clinical text. Dandala et al. observed that sentence boundary detection and tokenization are difficult issues in clinical text as sentence ends are frequently denoted by newline characters rather than punctuation [45]. This was echoed in another paper where it was noted that several generic sentence segmentation tools did not perform well due to differences in punctuation patterns and the use of newline characters in formatting [43]. Four studies overcame this by building their own custom tokenizer or sentence splitter [36,45,48,54].
Other data preparation tasks related to complexities in the data, such as class imbalance, duplicate sentences (due to copy-paste from previous notes), or overlapping entities. Class imbalance in particular was mentioned in several studies [33, 35, 39, 47,

Modelling
Tasks that were frequently described included named entity recognition (n = 17; 58.6%) and relation extraction and classification (n = 15; 51.7%). A diverse array of machine learning methods and models were reported in the studies. Several studies compared different methods or used ensembles of models in their tasks. Long Short Term Memory (LSTM) (n = 16; 55.2%) and Conditional Random Field (CRF) (n = 11; 37.9%) were the methods most frequently used. These methods are particularly suitable for NLP; LSTM because of its feedback connections which allow it to process sequences of information, and CRF because it does not assume that variables (in this case, words) are independent, and can therefore take context into account in making its predictions. Table 4 lists the methods.

Evaluation
3.8.1 Performance evaluation. Just over half (n = 16; 55.2%) of the articles reported the rationale for choice of performance metric, with metric used in challenge (n = 11; 37.9%) and standard or commonly used metric (n = 4; 13.8%) given as the main reasons. One article chose their metric given the unequal distribution of classes in the dataset, as the metric allowed them to highlight the positive class [51]. In total, nine (31%) articles reported an evaluation metric. Performance and evaluation methods for each article are described in Table 4.
The articles reported either precision and recall, or F1 score, or both. Some reported micro-averages of precision, recall, and F1 score which are useful when a system is applied to a multi-class classification problem, and gives an impression of the performance on individual classes (for example, ADE entity or ADE relation). Eight articles (27.6%) reported these performance measures under either strict or lenient matching or both; the other articles did not state whether strict or lenient matching was applied.
All articles reported the overall performance of their models across all prediction classes, whether it was entities (for example Diagnosis, ADE, Drug, Strength, Dose), relations (Drug-Dose, Drug-Symptom, Drug-Disease), or a combination of both (end-to-end performance). Additionally, some articles reported the performance on predicting just the ADE entity or ADE relation class; in all cases this was lower than the performance across all entities or relations (see Table 5).
Many methods were employed to account for model complexity such as hyperparameter tuning and regularization, and for measuring unbiased model performance, including k-fold cross validation. None of the articles indicated a cut-off value for determining a good performance in advance of performing the analysis.

Error analysis.
In total 12 articles (41.4%) report an analysis of errors made by their models. Table 6 provides examples of errors reported by at least two articles. Of those who gave details of an error analysis, five discussed possible changes to their methods on the basis of this analysis.

Deployment
None of the articles described implementation of their NLP application in clinical practice, but two did mention plans for implementation as part of future work [31,41]. To our knowledge, neither has published follow-up articles detailing a deployed system.

Main findings
We identified 29 studies that matched our inclusion criteria that reported on the application of NLP for the detection of ADEs. Our scoping review shows that at present the limiting step in creating NLP-based systems for ADE detection in hospitalized patients is data preparation including annotation and pre-processing of text. This seems especially problematic for languages

PLOS ONE
Adverse drug event detection using supervised natural language processing: A scoping review other than English. Also, although many off-the-shelf tools exists for data pre-processing, their usefulness for pre-processing clinical text is limited. These findings may explain the limited evidence of externally validated models or implementation of NLP applications in clinical practice. Although the included studies encompass diverse clinical domains, setting, narratives and methods used, LSTM and CRF (or a combination of these) methods are most frequently used in ADE detection from clinical narratives.

Business understanding
Just under one third of the studies explicitly reported clinical involvement. This involvement was limited to annotation of notes, annotation scheme design, and clinical chart review. None of the studies reported clinical involvement in areas such as overall design or interpretation of the results, although it is possible that this is a gap in reporting. These findings are similar to a recent review on clinical involvement in the development of machine learning clinical decision support systems, in which 21% of the studies on component development involved clinical experts in their process [66]. Simon et al. have strongly recommended that collaboration between technical and clinical teams is not only important but should be clinician-led when developing artificial intelligence solutions for medicine, as the two perspectives may not always agree [67].

Data preparation
Many of the difficulties encountered by the studies occurred at the data preparation stage when annotating and pre-processing the data. Off-the-shelf pre-processing tools perform poorly on clinical text for several reasons. The language is domain-specific, abbreviations and jargon are frequently used, words which can be inferred from context are skipped, and whitespace and new lines rather than punctuation are used in formatting [68]. This differs from the Table 6. Common errors described in error analyses.

Error Error description and example References
Intersentential relations missed Relation between drug and related entities missed due to entities in different sentences or long distance between entities in the text [33,34,43,45,49,50] "Haldol and Tradazone have been attempted at rehab without good effect and were discontinued due the drowsiness as well as (per ED report) some symptoms of lip smacking that were thought to be tardive dyskinesia."-relation between tradazone and tardive dyskinesia missed [43]

Entity confusion by model or annotator
Similar entities mislabelled as each other, such as dosage and strength, route and form, ADE and indication/ reason, ADE and sign/symptom [34,45,49,57,58] "She received one litre of normal saline"-annotators had difficulty determining if "one litre" is a Dose or a Strength entity [57] Omission in annotation Model predicts entity that is not annotated in corpus, or entity annotated in one instance but not in another instance [43,57,58] "Gabapentin 300 mg 3 times daily"-Frequency was missed during annotation [58] Failure to take account of attributes One of the entities is negated, speculative, or resolved but the relation is still identified [36,37,45] "no source of bleeding"-'bleeding' annotated as bleeding event [37] Inconsistent annotation Entity span boundaries differed within the corpus for the same entity [37,57] In one note "acute bleeding" annotated, but in another note only the word "bleeding" annotated [37] Entity missed Entity not identified due to misspelling or abbreviation [34,45] ". . .acute kidney injury due to genta"-drug entity missed due to use of abbreviation "genta" for gentamicin

Multiple entity labels apply to the same entity
Multiple labels can apply to an entity depending on the context of its related entities [45,58] "She was on furosemide and became hypotensive requiring norepinephrine"-"hypotensive" is an Indication for norepinephrine but an ADE for furosemide [45] https://doi.org/10.1371/journal.pone.0279842.t006

PLOS ONE
Adverse drug event detection using supervised natural language processing: A scoping review text used to train the standard pre-processing tools, which generally follow accepted grammar and punctuation rules. Some studies tackled this problem by building their own custom preprocessing tools. Other pre-processing tasks such as named entity normalization were only performed in studies where the language of the data is English. The majority of the tools available for this task are for English; therefore, researchers seeking to match their clinical narratives to standard terminologies in other languages face the additional barrier of building own tools from scratch. Joining efforts on national level in creating custom tools for clinical narratives in a specific language could partly circumvent this barrier [69]. Adapting tools that work well for English to another language could be another promising path [70]. Studies in which pre-processing tools for clinical narratives are compared are needed to support researchers in making a choice between the growing number of such tools [12,71]. The few studies that described in detail the creation of an annotated gold standard corpus describe a large effort to create relatively small datasets [33,39]. A 2020 review of studies on clinical NLP similarly noted that annotation is a bottleneck step in the use of clinical text data [60]. The effort required could be reduced by employing semi-automated methods to augment the annotation process. Such methods have demonstrated relatively small time savings of 13.85% to 21.5% per entity by employing dictionary-based pre-annotations, which were then checked (and corrected if necessary) by a human annotator [72]. Another study found that pre-annotations reduced the number of hand annotations necessary by 28.9% with consequent lower annotation time and higher inter-annotator agreement [73]. The availability of labelled data from the shared-task challenges greatly enhanced the efforts in applying NLP for ADE detection. Luo et al. [9] anticipated the significant impact of shared-task challenges in promoting and accelerating efforts in this area, the datasets from which were the basis for many of the articles included in this review. Just like for pre-processing tools, joining forces in creating annotated gold standard corpora for a specific task via shared-task challenges especially for languages other than English, should be encouraged.
The error analyses reported by the studies point to the importance and difficulty of the annotation task. Choices in annotation scheme design and accuracy of annotation scheme application both contributed to errors found in the studies, while other errors arose where annotators could not easily identify the correct clinical entity label to apply. These issues can be partly tackled by treating an ADE as a relation between a drug and non-drug entity and by making annotation schemes detailed and explicit. In particular, the representation of an ADE as an entity rather than as the relationship between a drug and non-drug entity led to avoidable errors. Where ADE is treated as an entity, the same symptom can be an adverse event in the context of one drug, but an indication in the context of another drug [45]; this makes it more difficult to accurately identify the ADE entity. For this reason we suggest that ADE should be treated as a relation between a drug and symptom entity, and not as an entity in itself. Treating an ADE as a relation between a drug and symptom entity reflects how clinicians think about such patterns in clinical data.
In their review of NLP in incident reporting and adverse event analysis, Young et al. noted that manual annotation is treated as a gold standard, yet the accuracy of the annotations determines the validity of the model accuracy measurements [74]. We agree that annotation accuracy plays an important role and also that annotation scheme design and choice of entity labels contributes to the validity of the results. The issues described in the studies could be tackled by involving clinicians in the process of designing and implementing an annotation scheme. Clinical narratives are designed to be interpreted by clinical readers, and clinicians have the requisite knowledge to interpret clinical text so as to derive maximum meaning from the annotated corpora. Clinicians also have the medical knowledge to apply these schemes correctly or to train lay annotators in their correct application. Clinicians can act as an interpreter between the written narrative and the process of automated extraction of data to ensure information is extracted as accurately as possible.

Modelling and evaluation
The included studies describe a variety of methods for the NER and RE tasks, and although variations of LSTM and CRF were most frequently seen, making a fair comparison between the reported methods is difficult. Factors such as dataset size, annotation quality and degree of model tuning affect performance of the methods. Additionally, variations on performance metric calculation and reporting make it difficult to compare between publications; although most papers reported the F1 score, the types of F1 score (micro-or macro-averaged) and the conditions under which it was reported (strict or lenient matching) varied, if they were reported at all. Indeed, even where the same type of metric appears to be reported, it may not have been calculated in the same way; a recent technical note highlighted that there are two different methods to calculate the macro-averaged F1 score which can differ in outcome by as much as 0.5 [75]. Both methods have each been described in a widely cited paper [76,77] and the choice of formula is seldom reported when using the macro-averaged F1 score. In addition, some studies did not focus on optimizing performance of their model, but on other factors relevant to ADE detection such as correcting for class imbalance, annotation scheme design, or corpus creation. The performance of the model also depends on the task, as a model may be well suited to RE but be less suited to NER. Comparison of optimized model performances for the same task, on the same dataset, and using the same evaluation script (such as occurs with the shared task challenges) is the most fair way to evaluate which models are suitable for the particular task of ADE detection in clinical narratives [23,24].
Overall performance of the systems was generally high but a steep drop in performance was reported when focusing on only the ADE entity or ADE relation class. This is because non-ADE entities such as drug names are relatively consistent in the data ("furosemide" will always refer to a drug name in the text) but this is not the case for ADEs, as "cough" can be an ADE in the context of lisinopril, but an indication in the context of codeine, or a symptom in the context of tuberculosis. This makes an ADE more complex to identify. Given that we are interested in detecting ADEs, the ability of systems to detect these ADE mentions in the text is more important than overall performance. When assessing performance it is therefore important to take into account performance on the ADE class and not just overall performance, as overall performance gives an artificially inflated impression of the ability to identify of ADEs in the text. This also reinforces our assertion that an ADE should be represented in data annotation as a relation between a drug and non-drug entities, to allow for accurate and consistent labelling of the data.
It is worth mentioning that in many of the studies where ADE is treated as an entity, the number of ADE entities can be over 1,000 [34, 43, 44, 56-58, 78, 79], and is usually higher than the number of notes included in the dataset. For example, the 2018 n2c2 challenge dataset used by seven studies [34,43,44,48,53,56,57] consists of 505 annotated documents. These 505 annotated documents contain 1,584 ADE entities (according to the challenge organizers [23]). At first glance this may seem like an extraordinarily high number of ADEs since the prevalence of ADEs varies between 1.9 to 57.9 ADEs per 100 patients [2]. However, as studies included in this review focus on NLP methods, several carefully preselected their datasets so that the notes would be more likely to contain ADE mentions [31, 33,37,40]; Henry et al. in describing the preparation of the n2c2 dataset ensured that each of the 505 discharge summaries include at least one ADE mention [23]. Therefore the number of ADE entities reported in the datasets does not necessarily reflect the number of ADEs in the studied patient populations. Additionally, none of the studies included in our review conducted a formal causality assessment between drug and adverse events found in the notes such as Naranjo probability scale or the World Health Organization Collaborating Center for International Dug Monitoring, the Uppsala Monitoring Center (WHO-UMC) criteria [80,81]. Also, in the study by Boyce et al. [31] as illustrated in Fig 1 notes were labeled as containing ADE mentions in cases bleeding and a drug known to cause bleeding were both mentioned in the same note, yet without the relation mentioned between them by the clinician. Both such practice and lack of formal causality assessment may explain the inflated ADE numbers. We strongly advise to use best practices for assessing whether or not ADEs are present in the clinical notes, in order to create a corpus to learn on. At present, most used best practice is a manual chart review by medical experts using formal causality assessment criteria [80]. Furthermore, although there are 1,584 ADE entities in the n2c2 dataset, the same dataset contains over 26,000 drug entities and over 80,000 entities overall, making the ADE entities a small portion of the total [23], which is similar in other datasets across the included studies. This disparity between the total number of entities/relations and ADE entities/relations made class imbalance a consideration in many of the studies [33,35,39,47,51,57,59].
While the proportion of papers reporting internal validation steps was high, only one paper reported external validation. Similar to the lack of external validation seen here, Spasic and Nenadic noted no hard evidence of generalizability or transferability in the studies included in their clinical NLP review [60]. Opportunities for external validation may be limited by the availability of data. Luo et al. in their 2017 paper noted that almost all of their studies focus on EHRs limited to within their own institutions [9]. We noted a shift in this trend, with a close to 50/50 split between the use of data from own institutions and from publicly available datasets. Shared tasks challenges such as n2c2 and MADE 1.0 in 2018 have increased the availability of labelled data, which are invaluable resources in this domain that can be used for external validation of other English language datasets.

Deployment
None of the included articles discussed the practical application of NLP models for ADE detection in the clinical setting. Boyce et al. state that their model could be deployed as a trigger tool to detect drug-related bleeding mentions in Emergency Department notes, after it had been externally validated [31]. To carry out any such implementation one would require not just buy-in from the EHR vendors, but also clinical involvement in the design and implementation of such a system. This is essential to ensure that any such systems aligns with the needs and workflows of clinicians and therefore is taken up and not a wasted investment. Researchers in the United Kingdom have demonstrated the utility of a deployed NLP system both in identifying patients for clinical trial participation and for converting clinical narratives to structured data in real-time (thereby removing the need for double data entry) [82]; more such projects are needed to generate interest and enthusiasm for the implementation of NLP systems to exploit the rich data hidden in clinical narratives.

Strengths and limitations
We created a framework for NLP workflow (Fig 1) based on CRISP-DM for this review and this framework can be understood as a supervised machine learning pipeline for NLP. The sequential steps of the pipeline can be applied to detect ADEs in clinical narratives.
We identified strengths and limitations of current research, and promising direction for future studies. Furthermore, both quantitative and descriptive data, including details of error analysis, are reported. This knowledge is crucial for data scientists to optimize performance of NLP pipelines for the task of ADE detection. It also helps clinicians and pharmacists to understand the value of NLP for their practice and how they could contribute to the development of more robust and clinically valuable NLP pipelines for ADE detection. In addition to standard literature databases MEDLINE and EMBASE, the pre-print server, arXiv, is included in our search strategy to look for state-of-the-art technical literature.
A limitation of this review is that we critically appraised the methods in the papers based on our own choice of tools. However, currently no validated assessment tool or reporting guideline specific to publications on clinical NLP exists. The upcoming artificial intelligence extension to the Transparent Reporting of a multivariable prediction model of Individual Prognosis Or Diagnosis (TRIPOD) statement and PROBAST tool (TRIPOD-AI and PROBAST-AI) [83] may provide more clarity on this issue, but at this time there is a gap for assessing clinical NLP models.

Future directions
Future work should investigate semi-automated methods to reduce the manual effort required to create annotated corpora to train NLP models, and examine how NLP can be deployed to detect ADEs in clinical practice. Making annotated corpora available for others to use will facilitate the training of data-hungry deep learning models and enable external validation. Furthermore, adding clinicians and pharmacists as team members when applying NLP for ADE detection should be a standard practice, since their expertise is needed to ensure high quality annotated data is created, and to elicit best suited strategies to implement NLP models into clinical practice. In order to assess the value of the future NLP pipelines, the reporting pratices must align with available and future reporting standards. Lastly, the recent advances in weakly supervised machine learning methods present new and exciting opportunities worth exploring to support NLP application for ADE detection, for example, helping in the annotation task [84].

Conclusions
The studies included in the review demonstrate that it is feasible to extract information on ADEs from clinical narratives using NLP. This is especially useful given that these data has the potential to be reused for multiple purposes, including routine ADE monitoring, clinical decision support, and research. Clinical involvement appears low and there is potential for clinicians to play an active role in the design and implementation of these types of systems. Performance on the ADE entity or ADE relation class is low compared to overall performance. The studies examined here demonstrate that multiple modelling approaches are available, but more work is needed in data preparation and deployment stages of the NLP process. Especially for languages other than English, extra barriers are present like lack of ontology-matching tools. When annotating corpora, treating an ADE as a relation between a drug and nondrug entity seems the best practice. Although the included studies encompass diverse clinical domains, setting, narratives and methods used, LSTM and CRF (or a combination of these) methods are most frequently used in ADE detection from clinical narratives.